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Information-Theoretic  Properties  of 
Languages  and  their  Grammars 
Bruce  J.  MacLennan 
Computer  Science  Department 
Naval  Postgraduate  School 
Monterey,  CA  93943 

It.  r- 

/  J  *  7  ,  ..  j.  .r 

Abstract:  We-describe^means  for  computing  a  number  of  information-theoretic  proper¬ 
ties  of  languages  and  their  grammars.  For  example,  the  entropy  of  a  system  of  sym¬ 
bols  is  widely  recognized  as  a  measure  of  that  system’s  complexity  and  organization. 
We  show  how  the  entropy  of  a  language  can  be  computed  in  a  simple  way  from  a  gram¬ 
mar  annotated  with  production  probabilities.  Wfe  then  develop*/neans  for  statistically 
estimating  these  production  probabilities  from  measurable  properties  of  strings  in  the 
language.  Wt^  also  consider^the  computation  of  other  information  theoretic  properties 
of  languages  and  grammars,  such  as  the  average  information  born  by  a  symbol  in  a 
language  and  the  average  information  used  by  the  productions  of  a  grammar.  _ 

f.  - 

1.  Introduction 

The  entropy  of  a  system  is  widely  recognized  as  a  measure  (actually,  a  reciprocal 
measure)  of  that  system's  organization  and  structure  [Shannon,  Brillouin,  Hamming, 
McKay,  Cherry],  This  suggests  that  the  entropy  of  a  language  might  be  an  important 
property  to  measure  to  form  a  basis  for  the  quantitative  comparison  of  languages.  For 
this  reason  we  have  developed  means  for  computing  the  entropies  or  languages. 
Specifically,  we  derive  formulas  for  computing  the  entropy  of  a  language  from  a  gram¬ 
mar  for  that  language  that  has  been  annotated  with  the  probabilities  of  its  productions 
being  applied.  We  also  show  techniques  whereby  these  production  probabilities  can  be 
inferred  from  statistical  properties  of  strings  in  the  language.  Finally,  we  apply  the 
same  techniques  to  several  related  issues,  such  as  determining  the  average  derivation 
length  of  a  grammar,  and  the  average  information  consumed  by  a  grammar  during 
string  generation.  These  all  seem  to  show  promise  as  a  means  for  making  quantitative 
language  comparisons. 
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2.  Entropy  of  a  language 


2. 1  Definition  of  Entropy 

Suppose  £  is  a  finite  system  of  symbols  ffi.ffg . at,  in  which  symbol  ot  has  a  priori 

k 

probability  of  occurrence  pj .  Naturally,  ^,Pi  =  1.  The  entropy  of  £  is  defined 

i=l 

w)  =  £piis(i/Pi). 

i=l 

where  Ig  *  =  Iog2*.  Since  the  entropy  docs  not  depend  on  the  symbols  olt  and  is  com¬ 
pletely  determined  by  the  probabilities  pit  it  is  simpler  to  define  the  entropy  in  terms 
of  the  a  priori  probability  distribution.  The  entropy  of  the  finite  discrete  probability 
distribution  PuPz.  .  .  .  .p*  is 

fHPi-Pz . Pfc)  =  £p»lR(l/Pi)  =  -£>tlg  Pi- 

t=i  t=i 

The  preceding  ideas  are  easily  extended  to  infinite  discrete  probability  distributions. 

•• 

Suppose  %Pi  =  1.  Wc  define  the  entropy  of  this  distribution: 

1=1 

H(pi,pz.  .  ..)  =  £'pilg(l/Pi)  =  -£>ilgPi. 

1=1  t=t 

Note  that  ^p,  =  1  does  not  guarantee  the  convergence  of  lgp*.  That  is,  there  are 
i  i 

probability  distributions  that  do  not  have  an  entropy.  Take,  for  example. 
Pi  =  C/(i  ln2i).  The  sum  ^Pi  converges,  but  the  entropy  £)Pilgp»  does  not.  For- 

i  i 

tunately,  these  troublesome  distributions  do  not  seem  to  occur  in  practice. 

Entropy  is  widely  recognized  as  a  measure  of  disorganization,  and  thus  lack  of  struc¬ 
ture  [Brillouin].  When  organization  increases,  entropy  decreases;  when  entropy 
increases,  structure  decreases.  Thus  it  is  usually  more  convenient  to  work  with  negen- 
tropy  rather  than  entropy.  The  negentropy  of  a  system  is  simply  the  negative  of  its 
entropy.  Thus,  when  organization  increases,  so  does  negentropy;  when  negentropy 
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decreases,  so  does  structure.  The  negentropy  H  of  a  discrete  distributional  is  defined 

HiPi)  =  =  SPilgPt. 

i 

2.2  The  Entropy  of  a  Language 

A  language  £  is  a  (usually  infinite)  set  of  strings 

£  =  \01.02 . oit .  .  .  J. 

Now,  let  P(Oi)  be  the  a  priori  probability  of  occurrence  of  the  string  Oi  in  £.  The 
negentropy  of  the  language  £  is  simply 

£(£)  =  £  P(°i)  lg  P(<Ji). 

»(CE 

In  most  interesting  cases  the  number  of  strings  in  a  language  is  infinite.  The  entropy  of 
an  infinite  language  is  thus  defined  in  terms  of  an  infinite  set  of  probabilities.  There¬ 
fore  for  most  languages  we  are  able  to  calculate  the  entropy  only  when  there  is  some 
finite  description  for  that  infinite  set  of  probabilities,  that  is,  when  there  is  some  struc- 

1 

ture  in  that  infinite  set  of  probabilities. 

Although  useful  languages  are  usually  infinite  (i.e.,  comprise  an  infinite  number  of 
strings),  they  can  be  described  finitely  by  a  grammar.  That  is,  the  grammar  reflects 
the  finite  structure  in  the  infinite  set  of  strings.  This  suggests  a  solution  to  the  prob¬ 
lem  of  finding  a  finite  description  of  the  infinity  of  probabilities  associated  with  the 
strings  in  the  language. 

The  generation  of  each  string  in  a  language  requires  a  finite  number  of  elementary 
choices  to  be  made.  For  example,  in  a  grammar  for  arithmetic  expressions  there 
might  be  two  productions  for  a  nonterminal  u: 

v  -»  + 

V  — 

In  deriving  a  string  from  this  grammar,  the  symbol  V  can  be  replaced  by  either  or 
a  choice  must  be  made.  Thus,  a  finite  sequence  of  choices  rrl,  t(2 . tt*  are 
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necessary  to  determine  each  string  in  the  language.  Conversely,  if  the  grammar  is 
unambiguous,  there  is  a  unique  such  sequence  for  each  string  in  the  language1. 


Now  suppose  that  each  elementary  choice  tt*  has  an  a  priori  probability  P(i jt)  of 
being  made.  If  these  probabilities  are  independent,  then  the  probability  of  the  result¬ 
ing  string  being  generated  is 

P(tt,)  P(nz)  ■  •  •  P(nk). 

Thus,  associating  a  probability  with  each  elementary  choice  permitted  by  the  grammar 
induces  a  probability  on  each  string  generated  by  the  grammar.  We  call  a  grammar 
with  such  associated  probabilities  an  annotated  grammar. 

There  is  of  course  no  guarantee  that  the  probabilities  induced  by  an  annotated 
grammar  are  in  fact  the  a  priori  occurrence  probabilities  of  the  strings  in  the 
language.  Indeed,  an  annotated  grammar  is  a  model  of  the  processes  that  in  reality 
generate  strings  in  the  language.  As  such,  it  might  or  might  not  be  a  good  model. 

We  say  that  an  annotated  grammar  predicts  a  language  if  it  generates  that  language 
and  induces  on  its  strings  their  actual  a  priori  probabilities  of  occurrence.  We  call  a 
language  predictable  if  there  is  an  annotated  grammar  that  predicts  it.  Clearly  then, 
we  can  determine  the  probabilities  of  the  strings  in  a  predictable  language  if  we  can 
find  an  annotated  grammar  that  predicts  that  language.  Further,  if  we  can  calculate 
the  entropy  of  the  language  generated  by  an  annotated  grammar,  then  we  will  be  able 
to  calculate  the  entropy  of  the  predictable  language.  In  the  following  sections  we 
develop  means  for  computing  entropies  from  annotated  grammars. 

3.  Computing  Negcntropy  from  Grammars 
3.1  Annotated  Regular  Grammars 

We  begin  our  analysis  with  a  particularly  simple  class  of  languages:  regular  languages 

[Ginsburg,  Hopcroft  &c  Ullman].  The  advantage  of  beginning  with  them  is  that  the 
1.  More  precisely,  in  an  unambiguous  grammar  there  is  a  unique  leftmost  derivation  for  each  string. 
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grammar  for  a  regular  language  can  be  written  as  a  single  nonrecursive  production 
making  use  of  only  a  few  simple  operators.  These  operators  are: 

name  notation  interpretation 

catenation  AB  an  A  followed  by  a  B 

alternation  AQB  an  A  or  a  B 

Kleene  star  A*  zero  or  more  As 

Kleene  cross  A*  one  or  more  i4s 

Any  regular  language  can  be  described  by  an  expression  formed  from  the  empty  string 
(e),  individual  tokens  and  these  operators,  appropriately  parenthesized8.  Such  an 
expression,  which  defines  a  regular  language,  is  called  a  regular  expression. 

For  example,  the  regular  language  of  signed,  nonnull  digit  strings  is  defined  by  the 
regular  expression: 

(+®-©e)(O0  1©2®3®4®5®6®7®8©9)+ 

This  expression  can  be  read,  “a  plus  or  a  minus  or  nothing,  followed  by  a  string  of  one 
or  more  digits.” 

As  discussed  in  Section  2,  to  compute  the  entropy  of  a  language  it  is  necessary  to 
know  the  probabilities  of  the  strings  in  that  language.  If  we  have  an  annotated  gram¬ 
mar  that  predicts  the  language  that  it  generates,  then  the  probabilities  of  these  strings 
can  be  computed  from  the  production  probabilities  (choice  probabilities)  in  the  gram¬ 
mar. 

In  deriving  a  string  from  a  regular  expression  there  is  only  one  situation  in  which  a 
choice  can  be  made:  from  A®0  we  can  derive  either  an  A  or  a  B.  Thus  we  can  anno¬ 
tate  a  regular  expression  by  associating  probabilities  with  all  the  alternands  of  an 
alternation.  We  write  the  probabilities  immediately  preceding  the  alternands  that  they 

2.  We  have  used  '®'  instead  of  the  usual  T.  since  the  latter  could  be  confused  with  conditional  probabilities, 
conditional  entropies,  etc.  Readers  unfamiliar  with  the  regular  languages  and  other  concepts  from  formal 
language  theory  should  consult  any  standard  text  on  the  subject  (e.g.,  Ginsburg  or  Hopcroft  &  Ullman). 


are  associated  with: 


p  A  ©  p  B. 

This  means  that  we  can  choose  an  A  with  probability  p,  or  a  B  with  probability  p. 
(Here,  and  throughout  this  paper,  we  use  p  as  an  abbreviation  for  1-p.) 

Since  one  of  the  alternands  must  be  chosen,  their  probabilities  must  add  to  unity. 
This  is  the  case  above,  since  p+p  =  1.  It  also  applies  if  there  are  more  that  two  alter¬ 
nands.  For  example,  if  we  have 

P\A\  QbpzAz®  ■  ■  ■  ®pnA„. 

then  we  must  have  pi +P2+  •  •  •  +p„  =  1. 

The  following  sections  develop  entropy  formulas  that  can  be  recursively  applied  to 
any  annotated  regular  expression  to  compute  the  entropy  of  the  regular  language 
predicted  by  that  expression.  When  these  results  have  been  obtained  we  will  show  that 
they  can  be  easily  extended  to  the  computation  of  the  entropy  of  any  context-free 
language  from  a  grammar  that  predicts  it. 

3.2  Entropy  Formulas 

We  derive  a  series  of  formulas  that  can  be  applied  recursively  to  an  annotated  regular 
expression  to  compute  the  entropy  of  the  language  predicted  by  that  expression.  In  all 
cases  we  assume  that  the  regular  expression  is  an  unambiguous  grammar  (i.e.,  there  is 
only  one  way  to  generate  a  given  string),  and  that  the  choices  leading  to  a  given  string 
are  independent. 

We  begin  with  the  simplest  regular  expressions,  the  empty  string  and  individual 
tokens,  and  proceed  to  the  catenation  and  alternation  operations. 

Theorem:  If  e  is  the  set  containing  just  the  empty  siring  and  r  the  set  containing 
just  the  individual  symbol  t  then 

77(e)  =  77(t)  =  o. 
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Proof:  By  definition  L(s)  =  JeJ  and  L{r)  =  jr{.  Since  there  is  only  one  symbol  in  each 
of  these  languages,  its  apriori  probability  is  1.  Hence, 

H(i)  =  H(t)  =  llgl  =  0. 

Theorem:  H(AB)  =  H(A)  +  H(B). 

Proof:  Suppose 

L(A)  =  fai,ct2 - J, 

L(B)  =  \Pi.P*  -  I 

Then 

L(AB)  =  |  aieL(A).  Pj€L(B)\. 

Let  PA(ai)  be  the  probability  of  choosing  from  Z.(4)  and  Pjj(Pj)  the  probability  of 
choosing  fy  from  L(B),  then,  since  we  are  assuming  these  choices  are  independent,  the 
probability  of  choosing  atfij  from  L(AB),  PABfaiPj).  >s  just  Now,  let 

Pi  =  /  ji(aj)  and  qi  -  Pj}(Pt)-  Then,  by  factoring  and  distributing: 

H(AB)  =  'ZPAB{aiPi)\gPABMj) 

=  ZPAMPeiPiMPAMPBiPj)] 

=  EEP*7/lg(Pi7j) 
t  i 

=  EPil£9/lg(Pi9/)] 
i  i 

=  EnlTtoO*  Pi  +  >R9>)1 

«  i 

=  4?  Pi  +  7, ]• 

i  i  i 

Now,  since  £]7/  =  1  and  77(B)  =  27jlg  7y. 
i  i 

H(AB)  =  2ft  [lg  Pi  +  //(£?)] 

t 

=  SPi>gPi  +SPi^W- 

i  t 
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Since  =  1  and  H(A)  =  £pilg  Pi,  we  have 
i  i 

77(45)  =  77(4)  +  77(5). 

Q  ED. 

DcfinilicHi:  The  n-fold  catenation  of  4  with  itself,  4",  is  defined: 

A0  =  £, 

A1  =  A. 

An *'  =  AAn,  for  n>0. 

Corollary:  H(An)  =  nH(A). 

Proof:  We  prove  the  result  inductively. 

77(4°)  =  //(e)  =  0  =  0  7/(4). 

Similarly, 

77(4')  =  77(4)  =  1  77(4). 

Proceeding  inductively  for  n>0, 

H(An+l)  =  77(44n) 

=  77(4)  +  77(4n) 

=  77(4)  +  nH(A) 

=  (n+l)77(4). 

<?5.5 

Theorem:  77{p4  ©p5j  =  77 (p,p)  +  p//(4)  +  pH(B). 

% 

Proof:  Let  G  =  p4  ®p5.  Then,  to  generate  a  string  in  7(G)  we  must  make  a  choice; 
with  probability  p  we  pick  a  string  from  7(4),  with  probability  p  we  pick  a  string  from 
7(5).  Let  o  be  a  string  in  L(G).  Since  we  are  only  considering  unambiguous  gram¬ 
mars,  0  must  have  come  from  either  7.(4)  or  7.(7/).  Suppose  that  ocL(A).  Since  the 
probability  of  a  selection  from  7(4)  is  p,  and  the  probability  of  getting  o  when  a  selec- 


tion  is  made  from  L(A)  is  Pa(o),  the  probability  Pc(o )  of  selecting  a  from  L(G)  L 
PPa(o)-  Similarly,  if  ae£(i?)  then  Pc(o)  -  PPb(° )•  These  observations  allow  us  to  com 
pute  the  entropy  of  G. 

H{G)  =  £  Pc(o)  lg  Pc(o) 

90,(0) 

=  £  Pc(o)  lg  Pc(o)  +  £  Pc{o)\gPc{o) 

oO(A)  aO(B) 

=  £^c(ai)  lg  Pc(oi)  +  £/,c(/S>)  lg  PdPj) 

4  i 

1 

=  £p^(«i)  lg pPA(ai)  +  Tj>ps(Pj)  Igp^(ft) 

i  i 

=  £ppiigppi  +  £pgjigp7j 
‘  i 

=  p£p,igm  +  p£g>igpg> 

*  i 

=  p£pi(igp  +  ig  Pt)  +p£g>0gp  +  ig  g/) 

4  > 

=  p[£p»igp  +  £p»igpi]  +p[£g>igp  +  £g^ig  g,]. 

‘  4  i  i 

From  the  definitions  of  71(A)  and  H(B)  and  the  fact  that  thep*  and  g,  sum  to  1  we  get 

R(G)  =  pflgp  +  77(<4)]  +  p[lgp  +  71(B)] 

=  pig  p  +  pig  p  +  pTl(A)  +  p77(B) 

=  Tl(p,p)  +  77(A)  +  71(B). 

Q.E.D. 

This  result  is  easy  to  generalize  to  the  n-fold  alternation: 

Theorem:  The  negentropy  of  an  n-fold  alternation  can  be  computed: 

Tl\PxAy®pzAz®  •  •  •  ©pn>y  =  77(p„pz . pn)  +  £p fH(Aj). 

}= i 

Proof:  Let  G  =  PyAy  ®pzAz®  ■  •  ■  ©pn/L»-  For  each  ai  jCL(Aj)  let  qitj  =  The 

proof  is  a  simple  generalization  of  the  previous: 


H{G)  =  Y  Pc(°)\&Pc(°) 

acL[G) 

=  £  £  Pc(°)  Ig  Pc(°) 

/=!  <rel(4,) 

=  £SPj9i.i1gP;9iJ 

/si  t 

=  £P/RX/lgP/9«./] 

/=i  * 

=  fip/HX/OgP/  +  >g  9ij)l 

/=!  i 

=  Sp>[S?i/lgP/+S9i./li9i./l- 

/si  i  i 

Since  £g»./  =  1.  #(Pi.P2 . Pn)  =  j^P/lgP/  and  H(Aj)  =  S^ijlg  9ij. 

i  /= i  i 

H{G)  =  £>/[lgP/  +/?(>»/)] 

/=» 

=  f>/lgp/  +  £p/ff(i4,) 

/=1  /=1 

=  7HPl.P2 . Pn)  +  £,PjH(Aj). 

/  =  ! 

Q.E.D. 

The  ©  operation  is  associative,  that  is, 

yt©(J?©C)  =  ,4®£©C  =  (4®5)0C. 

We  would  expect  the  annotated  version  of  this  operation  to  also  be  associative: 

pA®p(qB  ©gC)  =  p4  0  pg£7  0  pgC\ 

Thus,  if  our  negentropy  formula  is  correct,  we  should  get  the  same  value  for  the  negen- 
tropy  of  each  of  these  regular  expressions. 

Theorem:  77 \pA  G>p(qB  0  (jC)j  =  Tf\pA  ®pqB  0 pqC\. 

Proof:  We  derive  the  negentropy  of  the  right-hand  expression: 

H\pA®pqB  ®pqC\  =  TI(p,pq  ,pq)  +  pfi(A)  +  pqfi(B)  +  pqfl(C). 


» 


Next  we  derive  the  negentropy  of  the  left-hand  side  and  show  it  equals  the  expression 
above: 


H\pA  ®p(g5  ®  qC)\ 

-  fHp.p)  +  pH(A)  +  pH\qB  ©  gC) 

=  TUp.p)  +pH(A)  +  P[H(q.q)+qH(B)+qH(C)] 

=  H(p,p)  +pH(A)  +pH(q,q)  +  pqH(B)  +  pqH(C) 

=  H(p.p)  +  pH(q.q)  +  pH(A)  +  pqH(B)  +  pqH(C). 

Thus  it  remains  to  show  that 

7Hp.pq.pq)  =  THp.p)  +pB(q.q). 

Expanding  the  right-hand  side  above,  rearranging,  and  recalling  that  q+q  =  1,  we  get 

H(p.p)  +pH(q.q) 

=  pigp  +  pig p  +  pg lg  q  +  pgig  9 
=  pig  p  +  p(g+g)i  gp  +  pgig  g  +  pgig  q 
=  pigp  +pgigp  +pgigp  +pgigg  +  pgigg 
=  pigp  +pg('gp  +igg)  +pg(igp  +  igg) 

=  pigp  +pgigpg  +pgigpg 

=  THp.pq  ,pq). 

Q.E.D. 

We  now  consider  the  iterative  constructs  in  regular  grammars.  The  Kleene  cross. 

A f,  means  one  or  more  >4s.  Thus  A*  can  be  expanded  as  the  infinite  alternation: 

A*  =  A  ®  Az  ®  A3  © 

It  can  also  be  defined  by  the  recursive  formula: 

A*  =  A®AA*. 

This  kind  of  regular  grammar  is  converted  to  an  annotated  grammar  by  adding  a  con¬ 
tinuation  probability  p : 


Ap*  =  pA  ®ppAz®pzpAz®  ■  ■  ■ 


or  in  its  recursive  form 

Ap*  -  pA  ®pAAp*. 

We  will  derive  the  negentropy  formula  two  ways,  using  both  the  infinite  alternation  and 
recursive  definitions,  and  show  that  we  get  the  same  result. 

Theorem:  H\AP*\  =  [H(p,p)  +  H(A)]/p. 

Proof:  First  we  use  the  recursive  formulation: 

Ap *  =  pA  <BpAAp¥. 

Taking  the  negentropy  or  both  sides  we  have: 

H\A?+\  =  H\pA®pAAp*\ 

-  JHp,p) +pH(A) +pH\AAp+l 
=  H(p.p)  +  pH(A)  +  p[H(A)  +  /7MPM1 
=  H(p.p)+pH(A)  +pH(A)  +pH\Ap*\ 

=  H[p.p)  +  H(A)  +  pH\Ap*\. 

Solving  now  for 

(l-p)H\Ap*\  =  JHp.p)  +  71{A). 

Hence, 

H\Ap*  J  =  [THp.p)  +  H(A)]/p. 

Q.E.D. 

Next  we  compute  the  negentropy  directly  from  the  infinite  expansion  of  the  itera¬ 
tion: 

H\Ap*\  =  H\pA®ppAz®pzpA*<&  •••  ] 

=  P(P.PP ,PPZ,  .  .  .  )  +pH(A)  +ppH(Az)  +ppzTi(Az)  +  •  ■  •  . 

Recalling  that  ff(An)  =  nlf(A), 
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=  H(p,pp,ppz,  .  .  .  )  +  p[H(A)  +  2pH(A)  +  3pzH(A)  +  • 
=  THp.pp.pp2,  .  .  .  )  +  p[l+2p+3pa+  ■  ]H(A). 


Now  note  that  if  |p  |  <1  the  power  series  expansion  of  1/p8  is 

>'f!  =  = 1  ♦ » ♦  ** +  •  ■  •  ■ 

Therefore 

=  //(p.pp.pp8,  .  .  .  )  +p(l/pz)/?(^) 

*  fi(p,pp.ppz. . . . )  +  ri(A)/p. 

It  remains  to  simplify  H(p,pp,pp z,  .  .  .  ). 

7Hp.pp.pp 8.  •  •  •  )  =  pigp  +  ppigpp  +  PP2lgPP8  +  •  •  • 

=  £pPklgPPk 

fc=  o 
«• 

=  p£pkigp*p 

k=0 

=  p£pk[igp*  +  igp] 

Jt  -0 

=  p[£pfcigpfc  +  £pfcigp]. 

t  =0  k- 0 

Now  note  that  if  |p  |  <1  the  power  series  expansion  of  1/p  is 

1/p  =  — ^ — =  l+p+pz+p3+ •  •  •  =  Tlpk. 

1~P  fc=u 

Therefore, 

THp.pp.pp 8 —  )  =  P[£fcp*igp  +  (i/p)igp] 

fc-0 

=  [pigp  S*pk]  +  igp 

k-0 

=  fepigp  £tpk_,]  +  igp- 

*=i 

Using  again  the  power  series  expansion  for  1/p8  we  have 


H(p,pp.ppz.  •  •  •  )  =  W>lgp(l/p2)  +  Igp 
=  (plgp)/p  +  lgp 
=  (plgp  +  j51gp)/p 
=  H(p,p)/p. 

Therefore, 

H\AP*\  =  H(p,p)/p  +  H(A)/p  =  [i7(p.p)  +  £(yl)]/p. 

QE.D. 

The  Kleene  star,  A*,  means  zero  or  more  repetitions.  Thus  it  can  be  defined  by  the 
infinite  expansion 

A*  =  yl°® /l1  ©i42ffi 

where  A0  =  e  and  yl1  =  A.  Since  the  expression  following  the  first  ©  is  just  the  definition 
of  A*.  the  above  equation  can  be  written 

A*  -  e©>4\ 

I 

The  Kleene  star  can  also  be  defined  recursively: 

yl*  =  t  ©  A4*. 

This  notation  is  annotated  by  attaching  a  continuation  probability  p  to  the  star: 

Ap*  =  pi  ©pA4*. 

The  following  theorem  defines  its  negentropy. 

Theorem:  H\AP*\  =  [H{p,p)  +  pH(A)]/p. 

Proof:  There  are  several  ways  to  prove  this  result,  corresponding  to  the  alternate 
definitions  of  yl*. 

(1)  First  we  derive  the  negentropy  of  Ap*  from  the  negentropy  of  Ap¥.  Since 

Ap *  =  pi®pAp *, 

we  can  apply  H  to  both  sides: 


H\AP*  j  =  H\pt®pAp*\ 

=  H(p,p)  +  pH{t)  +  pH\Apk\ 

=  H(p,p)  +  p[H(p,p)  +  Ji(A)]/p 
=  \pH(p,p)  +  pH[p,p)  +  H(A)]/p 
=  [W(p,p)  +pH(A)]/p. 

Q.ED. 

(2)  We  can  also  compute  the  negentropy  from  the  recursive  equation 

Ap*  -  pe®pAAp*. 

We  apply  H  to  both  sides  and  get 

H\Apt\  =  H\pz®pAAp'\ 

=  H{p.p)  +  pH(z)  + 

=  /7(p,p)  +p 0  +  p[H(A)  +  //M**}] 

=  H(p,p)  +pH(A)  +pH\Ap'\. 

Crouping  the  unknowns  on  the  left  produces 

(1-P)/7MP'I  =  77(p.p)  +p7i(A). 

Recalling  thatp  =  1-p  we  have 

H\Ap'\  =  [THp.p)  +pfi(A)]/p. 

Q.ED. 

(3)  Finally,  wc  derive  the  negentropy  formula  from  the  infinite  expansion 

Ap*  =  pe  0pp/t  ©p2^/!2® 

Take  th  ?  negentropy  of  both  sides  to  get 

H\Ap'\  =  H\pe®ppA  ® pzpA2  ® p3p.43  ®  •  •  J 

=  H(p,pp,p2p,p9p,  .  .  .  )  +  ppH(A)  +  ptpHiA2)  +  p3p/7(i43)  +  •  • 
=  H(p,pp,Pep.p9p,  .  .  .  )  +  pjo(l  +  2p  +  3p2  +  )/7(i4) 

=  H(p.pp,pep.p9p.  .  .  .  )  +  (p/p)H(A ), 


where  we  have  used  1  /p2  -  1  +  2p  +  3p8  We  have  already  shown  in  the  deriva¬ 
tion  of  that 

H(p,pp.ppz.  •  •  ■  )  =  H(p,p)/p. 

Therefore  we  have 

H\AP*\  =  H(p,p)/p  +  (p/p)H(A) 

=  [H(p.p)  +pH(A)]/p. 

Q  ED 

We  can  check  these  results  by  computing  the  negentropy  of  Ap *  based  on  equation: 

Ap*  =  AApt 

Applying  H  to  both  sides  we  derive 

H\AP*\  =  H\AAp'\ 

=  H(A)  +  H[AP*\ 

=  H(A)  +  [IHp.p)  +  pH(A)Vp 

=  [D(p,p)  +pfI(A)  +pfl(A)]/p 
=  [H(p.p)  +  H(A)]/p. 

which  checks  with  our  previous  result. 

The  formulas  for  computing  the  negentropy  (and  hence  entropy)  of  a  regular 
language  are  summarized  in  Table  1. 

TABLE  1,  Formulas  for  Negentropy  of  Regular  Languages _ 

H\e\  =  0 

H\t\  =  0 

H[AB  j  =  H(A)  +  H(BJ 

H\pA®pB\  -  hjj),p)  +  pfJ(A)  +  pH(B) 

H\Ap*\  =  [//(p.p)  +  flLA)]/p 

_ H\AP  \ _ = _ I H{p,p)  +  pH(A)  1/p _ 

3.3  Examples 


In  this  section  we  illustrate  the  application  of  our  negentropy  formulas  with  several 
simple  examples.  Several  of  these  examples  are  based  on  free  languages: 
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Definition:  The  free  language  on  an  alphabet  T  is  the  set  of  all  finite  strings  (includ¬ 
ing  the  empty  string)  of  elements  of  T. 

Thus  T*  is  the  free  language  on  T.  In  most  cases  it  does  not  matter  what  the  alpha¬ 
bet  T  is,  so  we  speak  of  the  free  language  on  n  symbols.  Let  Tn  represent  any  alphabet 
of  n  symbols: 


Tn  =  Tj  0  Tg  ©  •  •  •  0  r„. 


Then  Fn,  the  free  language  on  n  symbols,  is  defined 

Fn  —  Tn  —  (T|  0  0  Tn)  . 

Of  course,  before  we  can  compute  the  entropy  of  a  language  we  must  annotate  its 
grammar  with  probabilities.  Therefore,  the  annotated  grammar  for  the  free  language 
on  n  symbols  is 

=  (giTi0gZT20  •••  ©g„Tn)P*. 


Theorem:  The  negentropy  of  the  free  language  on  symbols, 


Fn  ~  (giTi© 


is 


©gnTny’*. 


H(Fn)  =  [H(p,p)  +  p//(g, . g„)]/p. 

Proof:  We  simply  apply  the  formulas  from  Table  1: 

H{Fn)  =  /?l(giT,®  •  •  ©gnTn^'j 

=  [H(p.p)  +pfi\qiTl  0  •••  ®g*Tnj]/p 

=  [/7(p.p)  +p\71(ql . g„)  +  gi77(T,)  +  +  g„/7(rn)j]/p 

=  [H(p,p)  +  pH(qi . gn)]/p. 

QE.D 


The  free  language  on  one  symbol  t  is  just  the  set  of  all  strings  of  rs: 


L(Fi)  =  (e,  r,  rr,  ttt,  •  •  •  ). 


The  following  theorem  defines  the  negentropy  of  F\ 


Corollary:  The  negentropy  of  the  free  language  with  continuation  probability  p  on 
one  symbol  is: 


H[FX\  =  H(p,p)/p. 


Proof:  We  simply  use  the  previous  theorem  with  n  =  1. 

Corollary:  The  negentropy  of  the  free  language  with  continuation  probability  p  on  an 
alphabet  of  n  equally  likely  symbols  is 

UHp.p)  -p\gn]/p. 

Proof:  To  derive  this  simply  set  qx  =  1/n  in  the  negentropy  formula  for  Fn: 

H(Fn)  =  lH(p.p)  +pH(qx . qJVp 

=  [H{p,p)  +  pH(\/n . 1  /n)\/p 

-  [P(p.P)  +P$l(l''n)lg(l/n)1/p 

i=l 

=  [fl(p.p)+P  lg(l/n)]/p 
=  [H(p.p)  -p\gn]/p. 

Q.E.D. 


Table  2  shows  the  entropies  of  free  languages  on  equally  likely  symbols  for  several 
different  continuation  probabilities. 


TABU!  2.  Entropies  of  Free  Languages  on  Equally  Likely  Symbols 


p\n 

2 

4 

8 

10 

12 

64 

256 

0.63 

0.74 

0.85 

0.89 

0.92 

1.19 

1.41 

m 

1.15 

1.40 

1.65 

1.73 

1.80 

2.40 

2.90 

0.3 

1.69 

2.12 

2.54 

2. 68 

2.80 

3.83 

4.69 

2.28 

2.95 

3.62 

3.83 

4.01 

5.62 

6.95 

■teJi 

3.00 

4.00 

5.00 

5.32 

5.58 

8.00 

10.00 

0.6 

3.93 

5.43 

6.93 

7.41 

7.80 

11.43 

14.43 

■sin 

5.27 

7.60 

9.91 

10.69 

11.30 

16.94 

21.60 

0.8 

7.61 

11.61 

15.61 

16.90 

17.95 

27.61 

35.61 

0.9 

13.69 

22.69 

31.69 

34.59 

36.95 

58.69 

76.69 

This  table  suggests  that  we  consider  the  special  case  in  which  to  is  a  power  of  two  andp 
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is  one  half.  This  leads  to: 


Corollary:  The  negentropy  of  the  free  language  with  continuation  probability  one 
half  and  2*  equally  likely  symbols  is  -lg  k  -  2.  Conversely,  the  entropy  is  Ig  lb  +  2. 

Proof:  Letp  =  %  and  n  =  2*  in  the  formula  from  the  previous  corollary  and  we  have 

H  -  [7J(p,p)  -plgn]/p 

=  2[J*lgJ*  +  fclg)£-*/2] 

=  21gfc-lg* 

=  -2 -lg*. 

Q  ED. 

3L4  Computing  the  Entropy  of  a  Context-Free  Grammars 

In  this  section  we  extend  the  results  of  the  previous  sections  to  the  computation  of  the 
entropy  of  any  context-free  grammar. 

As  usual,  we  define  a  context-free  grammar  C  to  be  a  quadruple, 

G  -  <T,N,  P.  u0>. 

in  which  T  is  a  finite  set  of  terminal  symbols,  N  is  a  finite  set  of  nonterminal  symbols, 
Vq€N  is  the  goal  symbol,  and  P  is  a  finite  set  of  productions, 

P  C  N  X  (f  u  N)\ 

That  is,  each  production  is  a  pair  of  the  form  <u,a>,  in  which  v  is  a  nonterminal  and  a 
is  a  finite  string  of  terminals  and  nonterminals.  Such  a  production  is  usually  written 
‘u  -*  a\  The  Backus-Naur  form  (BNF)  of  a  context  free  grammars  combines  all  the  pro¬ 
ductions  for  a  given  nonterminal  into  a  single  productions.  For  example,  if  context- 
free  grammar  contains  the  following  productions  for  v: 

v  ■*  at 
v-a2 
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» '-*<*» 

then  the  BNF  form  of  this  grammar  combines  them  into  a  single  production: 

v  -»  a!  ®  a2  ©  •  •  •  ©  an. 

In  the  following  discussion  we  will  usually  use  the  BNF  form  of  grammars. 

The  characteristic  that  distinguishes  context-free  grammars  from  regular  gram¬ 
mars  is  that  the  productions  of  a  context-free  grammar  can  be  mutually  recursive. 
That  is,  a  nonterminal  v  can  be  defined  in  terms  of  a  string  that  is,  directly  or 
indirectly,  defined  in  terms  of  v.  It  is  well  known,  however,  (see  Ginsburg)  that  each 
production  in  a  BNF  grammar  can  be  considered  an  equation  on  sets  of  strings.  If  we 
recursively  define  L(a),  the  language  defined  by  o,  as  follows: 

L(e)  = 

L(t)  -  \r\ 

L(af})  =  L(a)  X  /,(/?) 

where  S*T  =  { a/9  |  aeS.  QcT] 

L(a®0)  =  L(a)  u  L(p) 

then  each  production  v-*a  of  a  context-free  grammar  G  can  be  transformed  into  a 
corresponding  equation 

L(  v)  =  L(a). 

Let  G  =  <T,N.P,v0>  be  a  context-free  grammar,  in  which 

P  =  |t>0  -*  oto.  vi  -»  aj . vk  -»  ak { 

is  a  set  of  productions  in  BNF  form.  Then  corresponding  to  P  is  a  collection  of  simul¬ 
taneous  equations  on  sets  of  strings: 


L(t/0)  =  L(a  o) 

L{vx)  =  L(ax) 
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L(vk)  =  i(dfc) 


The  solution  to  this  set  of  equations  defines  the  language  generated  by  G,  since 
L(G)  =  L(v o). 

Context-free  grammars  can  be  annotated  with  production  probabilities  in  the  same 
way  as  regular  grammars.  Formally,  we  define  an  annotated  context-free  grammar  G 
to  be  a  quadruple  <T,  N,  P,  i/0>,  in  which  t/0eN  and 

PcRXNX(TuN)\ 

Thus  each  production  is  a  triple  <p,v,a>,  p  being  a  real  number  representing  the  pro¬ 
bability  of  applying  the  production  v-*a.  We  impose  the  restriction  that  all  the  proba¬ 
bilities  associated  with  the  productions  for  a  given  nonterminal  must  sum  to  unity: 

£  Pi  =  1.  for  v&N. 

<p4.K.at>  e  P 

This  is  simpler  to  see  in  the  BNF  form  of  an  annotated  context-free  grammar.  In  any 
production  that  is  an  alternation, 

v  -*  Piai®p2az®  •  ®pna„, 

we  must  have  that 

=  i- 

t=i 

Consider  an  unambiguous  annolaled  eon! exl -free  grammar  C7  and  lei  £  =  f,(G)  be  Ihe 
language  generated  by  G.  Let  Pc(ct)  be  the  probability  that  a  string  a  is  generated  by 
G.  We  say  that  E  is  predicted  by  G  if  for  every  string  <7,  Pz(o)  =  Pc(o),  that  is,  the 
observed  probability  of  occurrence  of  a  is  the  same  as  the  probability  of  its  generation 
by  G.  We  now  consider  how  we  might  compute  the  negentropy  of  £  from  G. 

Consider  a  production  u-*a  in  the  annotated  grammar;  this  corresponds  to  an  equa¬ 
tion  L{v)  -  L(a).  Since  u-*a,  the  probability  of  a  string  being  generated  from  v  is  the 
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^  IJ 


same  as  its  probability  of  being  generated  from  a.  Thus,  Pv{o)  =  Pa(a),  for  all 
a  €  L(v)  -  L(a).  Thus,  the  negentropy  of  L(v),  which  we  can  write  H(v),  is  the  same  as 
the  negentropy  of  L(a),  which  we  can  write  H(a).  That  is, 

H(v)  =  /7(a). 

It  can  be  seen  that  corresponding  to  the  BNF  productions  i/i-*a,  in  P  there  is  a  set  of 
simultaneous  equations 

/7(i/0)  =  77(a0) 

77(i/,)  =  /7(a,) 

H(un)  =  77(an) 

that  can  be  solved  to  yield  the  negentropy  of  the  language  predicted  by  the  grammar. 
In  particular, 

77(E)  =  77(G)  =  77(i/0). 

We  have  already  made  used  this  technique  in  applying  the  recursive  definitions  of  of  Ap* 
and  Ap*  to  solve  for  their  negentropy.  In  summary,  the  methods  developed  previously 
for  computing  the  negentropies  of  regular  languages  can  be  extended  in  the  obvious 
way  to  context-free  languages. 

4.  Determining  the  Production  Probabilities 
4.1  Measurable  Properties 

To  compute  any  specific  entropies  we  need  to  know  the  probabilities  of  applying  the 
productions  in  the  appropriate  grammar  for  the  language.  These  can  be  obtained  by 
determining  measurable  parameters  whose  values  are  implied  by  the  production  pro¬ 
babilities.  That  is,  the  measurable  properties  are  a  function  of  the  production  proba¬ 
bilities.  The  production  probabilities  can  then  be  determined  by  (analytically  or 
numerically)  inverting  this  function. 
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What  measurable  properties  should  we  use?  One  of  the  simplest  is  the  density  of 
occurrence  of  a  token.  Let  OccT(a)  be  the  number  of  occurrences  of  the  symbol  t  in 
the  string  a.  This  is  formally  defined: 


Occ^e)  =  0, 

Occ  T(r)  =  1, 

Occt(t')  =  0,  for  T/r', 

0cCt(<t)  =  0ccr(a)  +  0ccT(/S),  for  a  =  a/3. 


The  Ar(C),  then  density  of  occurrence  of  r  in  the  language  generated  by  G  is 


Ar(C)  = 


£  Pc(°)0ccT(o) 


E  ^c(a)|a| 

a  ei-(C) 


where  is  the  predicted  probability  of  generation  of  <7  and  |o|  is  the  length  of  a.  If 

G  predicts  L(G),  then  A,.(G)  will  be  the  observed  density  of  occurrence  of  t  in 
languages  generated  by  G. 


The  formula  for  A T(G)  suggests  two  useful  properties  of  a  grammar:  the  average 
length  of  the  strings  it  generates  and  the  average  number  of  occurrences  or  a  token  in 
a  string.  We  let  A(G)  be  the  predicted  average  length  of  the  strings  generated  by  G: 

A(C)  =  £  Pc(a)  \a\. 

« 1(C) 

We  let  4>r(G)  be  the  predicted  frequency  of  occurrence  of  r  in  the  strings  generated  by 
G: 


It  then  follows  that 


*t(G)  =  E  Pc(°)0ccr(o). 
oel(C) 


A r(G)  =  *r(G)  /  A (G). 

The  goal  then  is  to  find  ways  to  compute  $T(G)  and  A (G)  from  G.  This  will  permit  us  to 
calculate  predicted  values  of  A,  which  can  be  compared  with  actual  measurements. 


Therefore,  the  next  two  sections  present  means  for  calculating  A (G)  and  4>T(G). 


4.2  Average  String  Length 

We  begin  again  with  regular  grammars. 

Theorem:  The  average  lengths  of  the  empty  and  single  token  grammars  are  defined: 

A(e)  =  0  tokens, 

A(t)  =  1  token. 

Proof:  Obvious. 

For  the  remaining  derivations  we  need  some  notation.  Suppose  that 

L(A )  =  fa,,a2.  .  .  .  J. 

L(B)  =  { fh.fo.  .  .  . 

•h  = 

bj  =  Pb(Pj). 

Then  it  follows  that 

A(4)  =  Sajcul. 

i 

MB)  = 

i 

Theorem:  The  average  length  of  the  catenation  of  two  grammars  is  the  sum  of  their 
average  lengths: 

MAP)  =  A(A)  +  A(Z?). 

Proof:  Note  that  is  o€.L(AB)  then  o  =  for  some  ateL(y4),  ^eL(£?).  Assuming  as 
usual  that  the  choices  from  A  and  B  are  independent, 

Pab(°)  =  PaMPb(Pi)  = 

We  now  derive  the  average  length: 


-24- 


A (AB)  =  I 

=  S2“ibi(l«*l  +  Iftl) 

i  i 

-  +  £b>l/9>ll 

«  i  i 

-  S°*C  I  ai  I  +  A(5)] 

i 

=  £®il«il  +  S“iA(^) 

i  i 

=  A(A)  +  A (B). 

Q.E.D. 

Corollary:  A(>ln)  =  nA(/1) 

Theorem:  The  average  length  of  an  alternation  is  the  average  of  the  average  lengths 
of  the  alternands: 

A(pA®pB)  =  p\(A)  +  ph(B). 

Proof:  Let  G  =  pA  (bpB.  Recall  that  if  ocL(G)  then  either  ocL(A)  or  ozL(B),  and  that 
the  choice  from  A  is  made  with  probability  p .  Therefore 

Pc{a)  -  pPA{o),  \l  ozL{A), 

Pc(a)  =  pPB(a).  if  oeL(B). 

Then  we  derive: 

A(C)  =  V  Pc(o)\a\ 

•cTta 

=  EPc(<*i)l<*il  +£Pc(%)l0}l 

*  i 

-  I  ai  I  +  EpWI 

*  j 

t  i 

'  PA{A)+pA{B). 

Q.E.D. 
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Theorem:  A(/4P+)  =  A(A)/p. 

Proof:  We  appeal  to  the  infinite  expansion: 

Ap  *  =  pA  ®ppAz®pzpAa® 

Applying  A  to  both  sides: 

A(AP+)  =  pA(A)  +ppA(As)  +  pzpA(Aa)  +  •  •  • 

=  pi  1  +  2 p  +  3p2  +  •  •  •  ]A(i4) 

=  p[l/p2]A(A) 

=  A (A)/p. 

Q.E.D.  Alternately,  we  can  appeal  to  the  recursive  definition: 

A  {AP*)  =  pA(A)  +  pA(A4pf) 

*  pA(A)  +pA(A)  +  pA(Apt). 

Grouping  like  terms  gives 

(l-p)A(Ap+)  =  (p+p)A(/l). 

which  leads  directly  to  the  result.  Q  E.D. 

Theorem:  A(AP*)  =  {p/p)A(A). 

Proof:  We  apply  A  to  the  infinite  expansion  of  i4p*: 

A  {AP*)  =  pA(e)  +•  ppA(A)  +  p2pA(A2)  +  p^pA(A3)  +  •  •  • 

=  p-0  +ppA{A)  +  p2p  2A(A)  +  p3p  3A(/4) 

=  pp(l  +  2p  +  3 p2  +  •  •  •  )A(A) 

=  pp(l/pz)A(A) 

=  (p/p)A(A). 

QED 

Alternately,  we  can  apply  A  to  the  recursive  definition: 

A(AP*)  =  p  A(e)  +  pA(A4p#) 
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—  p-0  +  p\(A)  +  p\(AP*). 
Solving  for  AMO  we  get 


A(i4p#)  =  (p/p)\(A). 


Q.E.D. 

We  consider  some  simple  examples  based  on  free  languages. 

Theorem:  Consider  the  free  language  on  n  symbols  generated  by: 

Fn  =  (9iTi©gaTa®  •••  ®7nT„)0 
The  average  length  of  the  strings  of  this  language  is 

\  =  A (Fn)  =  p/p  tokens. 

Proof:  Since  A(r*)  =  1  and  gj  +  •  •  •  +  gn  =  1. 

A(F„)  =  (p/p)[g  iA(ti)  +  •  ••  +  g„A(rn)] 

=  (p/p)[q  i  +  •  •  •  +  gB] 

=  p/p- 

Q.E.D. 

Notice  that  the  average  length  of  a  free  language  is  independent  of  the  number  of 
symbols  in  the  alphabet.  This  is  to  be  expected. 

Corollary:  The  average  length  of  a  free  language  with  continuation  probability  is  1 
token. 

Proof:  Apply  the  previous  theorem  with p  -  p  -  %  Q.E.D. 


TABLE  3.  Average  String  Length  for  Regular  Grammars 


A(e) 

= 

0  tokens 

A(t) 

= 

1  token 

h(AB) 

2 

AM)  +  A  (B) 

A(pA  <&pB) 

=: 

pfi(A)+ph(B) 

A  Mp+) 

2 

A  (A)/p 

AMO 

s 

The  formulas  for  computing  average  string  length  are  summarized  in  Table  3. 
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4.3  Average  Token  Frequency 


The  formulas  for  computing  average  token  frequency  are  almost  identical  to  those  for 
average  length.  For  this  reason  the  proofs  are  omitted  and  the  results  are  shown  in 
Table  4. 


TABUS  4.  Average  Token  Frequency  for  Regular  Grammars 


*r(e) 

=: 

0 

*T  (r) 

= 

i 

fr(r’) 

0,  for  r*  r* 

*T (AB) 

5; 

$t(A)  +  ir(B) 

$T (pA  ®pB) 

r: 

p$r(A)  +P$t(B) 

*r (An 

5 

$r(A)/p 

trW) 

= 

(p/P)*r(A) 

Theorem:  Consider  the  free  language  on  n  symbols: 


Fn  =  (9lTi®9zTZ©  9nT„)P'. 


The  frequency  of  occurrence  of  token  T*  is 


<Pi  -  M^n)  =  9»P/P- 


Proof:  We  derive  as  follows,  abbreviating  $T<  by  $t: 

*i{Fn)  =  (p/p)$i[giT,©  ■  •  ®gnT„] 

=  (p/p)[gi$»(Tj)  +  •••  +gi*»(Tj)+  •••  +gn*i(T„)] 

=  (p/f)[grO+  +  g*i  +  •••  +g»o] 

=  (p/P)gt. 

Q  ED. 

Corollary:  In  the  free  language  of  the  previous  theorem,  the  density  of  occurrence  of 
symbol  t*  is  gi.  We  denote  this  measurable  property  <5*. 

Proof:  Since  (Fn)/\(Fn),  we  have 

Si  —  p</X 

_  g iP/p 
p/P 

=  g*. 
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Mi.  ■ 


Q.E.D. 

The  following  corollary  shows  that  for  the  case  of  a  free  language  it  is  easy  to  com¬ 
pute  the  production  probabilities  from  measurable  properties  of  strings  in  the 
language. 

Corollary;  If  we  measure  the  properties  di.  62 . d„  and  X  of  a  free  language  on  n 

symbols,  then  we  can  compute  the  probabilities  glt  q . . qn  and  p  from  them  by  the 

formulas: 

g*  =  <*t. 
p  ~  X/(X+1). 

Proof:  The  formulas  for  g *  are  obvious.  For  p  we  know  that 

X  =  p/p  =  p/(l-p) 

Therefore  X-p\  =  p,  so  X  =  p+pX.  Thus  X  =  p(X+l),  sop  =  X/(X+l). 

Q.E.D. 

Corollary;  The  negentropy  of  a  free  language  exhibiting  occurrence  densities  d  . . 

dB  and  average  string  length  X  is: 

Hn  =  //(X.X+1  )  +  X77(d, . d„). 

Proof:  To  derive  this  result  we  take  the  formula  for  the  negentropy  of  a  free  language, 

Hn  -  H(Fn)  =  [H(p.p)+pH( . qn)]/p. 

and  substitute  the  values  for  g*  and  p  derived  in  the  previous  corollary.  To  do  this, 
note  that 


Then  we  have 


=  Alg^-lgCA+lJ  +  X/W, . <5„) 

=  A  Ig  A  -  A  lg(A+l)  -  lg(A+ 1)  +  A77(<5, . <5n) 

=  A  lg  A  -  (A+l)lg(A+l)  +  \H(6\ . 6n) 

=  tf(A.  A+l)  +  Atf (<5, . <5n). 

Q  ED. 

Thus  we  have  the  negentropy  (and  hence  entropy)  of  a  language  expressed  entirely 
in  measurable  parameters. 

4.4  Average  Information  per  Symbol 

Recall  that  the  entropy  of  a  language  measures  the  average  information  born  by  each 
string  in  the  language.  That  is, 

tf(£)  =  E  Pdo)h(a). 

oti: 

However,  each  of  the  strings  of  the  language  is  composed  of  a  number  of  terminal  sym¬ 
bols  (tokens).  Therefore,  it  is  interesting  to  compute  the  average  information  born  by 
each  symbol  (token)  in  the  strings  in  the  language.  We  call  this  the  information 


density  of  the  language. 

Information  density  is  easy  to  compute:  it  is  simply  the  average  information  born 
by  the  strings  of  the  language  divided  by  the  average  length  of  those  strings: 

*?(£)  =  //(£)/  A(E). 

where  we  have  used  rj( E)  for  the  average  information  born  per  symbol  in  the  strings  E. 
The  units  of  information  density  are  bits/token. 

If  the  grammar  G  predicts  the  language  E,  then 

»7(E)  =  t j(C)  =  H(G)/KG). 

We  use  this  result  to  compute  the  information  density  for  several  languages. 

Theorem:  Let  Fn  be  the  free  language  with  continuation  probability  p  on  n  symbols 
with  probabilities  g4.  Then,  the  information  density  of  Fn  is: 

7){Fn)  =  H{p,p)/p  +  H(q  . . qn)  bits/token. 

Proof:  Take  the  formula  for  the  negentropy  of  Fn  and  negate  it  to  get  the  entropy  for¬ 
mula: 

=  [tf(p.p)  +  ptf(g  . . <?«)]/?• 

Divide  this  by  the  average  length  A (Fn)  =  p/p  to  get  the  information  density: 

ri(Fn)  =  H(Fn)/  K(Fn) 

_  [H(p,p)  j-pHjgx . 9n)l/P 

p/p 

=  H{p,p)/p  +  //(q, . qn). 

Q.E.D. 

Corollary:  The  information  density  of  the  free  language  on  one  symbol  with  continua¬ 
tion  probability  p  is: 

rj(F\)  =  H(p,p)/p  bits/token. 
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Corollary;  The  information  density  of  the  free  language  with  continuation  probability 
p  on  n  equally  likely  symbols  is: 

rj  =  ff(p.p)/p  +lgn  bits/token. 

Proof:  We  simply  use  gt  =  1/n: 

t?  =  H(p.p)/p  . qn) 

=  H(p.p)/p  +  //(1/n . 1/n) 

=  H(p.p)/p  +  (l/n)lgn  +  ■  •  ■  +  (l/n)lgn 
-  H{p,p)/p  +  lg  n  bits/token. 

Q.E.D. 

Corollary:  The  information  density  of  the  free  language  with  continuation  probability 
one  half  and  2*  equally  likely  symbols  is  fc+2  bits/token. 

Proof:  Let  n  =  2*  and  p  -  p  =  in  the  previous  formula: 

■n  =  tf(&J0/O0  +  ig  2* 

=  2//<fcfc)  +  * 

=  2(fclg2  +  }*lg2)  +  * 

=  2$  +  fc)  +  fc. 

Q  ED. 

The  difference  between  the  entropy  of  a  language  and  its  average  information  den¬ 
sity  can  be  understood  by  looking  at  some  simple  examples.  In  particular  we  will  con¬ 
sider  the  languages  Nk  of  all  nonempty  strings  on  k  symbols.  Thus,  Nk  is  just  the  free 
.  language  Fk  without  the  empty  string.  Conversely, 

Em  =  Nke  r. 

Theorem:  Let  Nk  =  (g,Ti®  •  ■  •  ®gi,  r*)p+.  Then, 

=  [H(p.p)  +  H(qi . qk)]/p  bits 

A(A 4)  =  1/p  tokens 
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V (Nk)  ~  H(p,p)  +  H(qx . qk)  bits/token. 

Proof:  Simply  apply  the  previous  formulas.  Q.E.D. 

To  understand  the  implications  of  this  result  we  consider  an  especially  simple  case, 
N i,  the  language  of  all  nonempty  strings  on  one  symbol  r. 

NX  =  T*\ 

Hence  L(NX)  =  Jr,  tt.  ttt,  .  .  . 

Corollary;  If  the  continuation  probability  of  Nx  is  one  half,  then 

H(NX)  =  2  bits 
A(N i)  =  2  tokens 
rj(N  i)  =  1  bit/token 

1 

Proof:  Substitute  p  =  in  the  previous  formulas  and  recall  that 

=  Hlg2  +  Hlg2  =  tg2  =  l  bit. 

Q.E.D. 

This  result  is  easy  to  interpret,  since  each  succeeding  token  indicates  that  the 
choice  has  been  made  to  continue  the  string.  Since  the  probability  of  the  choice  is  one 
half,  each  token  conveys  one  bit  of  information. 

Next  we  consider  Nz,  which  can  be  considered  the  language  of  nonnull  strings  of 
binary  digits: 

Nz  =  (0  @  1)\ 

The  following  theorem  addresses  the  information  density  of  this  language. 

Ilieorem:  If  Nz  is  the  language  of  nonnull  binary  strings: 

Nz  =  (gO®  W. 

then 
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H{NZ)  =  [H(p,p)  +  H(q,q)]/p  bits 

A (Nz)  =  1  /p  tokens 

rj(Nz)  =  H(p,p)  +  H(q,q)  bits/token. 

Proof;  Apply  the  previous  formulas.  Q.  E.  D, 

Corollary:  Suppose  the  binary  digits  0  and  1  are  equally  likely.  Then  the  information 
density  of  the  language  of  nonnull  binary  strings  is 

V(NZ)  =  1  +  H(p,p)  bits/token. 

Proof:  Apply  the  previous  theorem  with  q  =  q  =  %  Q.E.D. 

Note  that  since  H(p,p)> 0,  we  know 

T){Nz)  >  1  bit/token. 

Since  in  this  case  a  token  is  a  binary  digit,  we  have  the  somewhat  surprising  result  that 
the  information  density  of  the  language  of  binary  strings  is  greater  than  one  bit  per 
binary  digit.  How  can  this  be?  The  extra  H(p,p)  bits  of  information  per  binary  digit 
comes  from  the  fact  that  the  binary  strings  are  variable  length.  We  previously  saw  that 
in  Nx  the  continuation  of  the  string  conveys  II(p,p)  bits  of  information. 

The  source  of  the  extra  information  can  be  made  clearer  by  considering  a  language 
in  which  it's  absent,  the  language  of  all  n  digit  binary  strings: 

K  =  (gO  $  gl)n. 

Let  g  =  g  =  Jf.  Then 

H(Wn)  =  «//(&$  =  n  bits 
A(W,n)  =  n  A{g0®glj  =  n  tokens 
rj(Wn)  -  n/n  =  1  bit/token. 

Thus  the  information  density  of  Wn  is  one  bit  per  binary  digit,  as  expected.  Since  all 
the  strings  of  Wn  are  the  same  length,  no  information  is  conveyed  by  the  continuation 
of  the  string. 
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Consider  again  the  language  of  nonempty  base  k  strings: 


Nk  =  (giT,®  •  •  •  © 

We  have  seen  that  its  information  density  is 

ri(Nk)  =  fHp.p)  +  H(qh  .  .  .  ,qk)  bits/tokea 

Now  we  can  see  that  this  result  is  intuitive,  since  each  additional  symbol  in  a  string 
conveys  two  pieces  of  information:  the  decision  to  continue,  H(p,p)  bits,  and  the  sym¬ 
bol  chosen  for  the  continuation,  II (q j,  ■  ■  ■  ,qk)  bits. 

Sl  Information  Theoretic  Properties  of  Grammars 

In  this  section  we  consider  two  information  theoretic  quantities  that  are  properties 
of  grammars,  as  opposed  to  properties  of  the  languages  predicted  by  those  grammars. 
These  properties  are  the  average  length  of  a  derivation  from  a  grammar  and  the  aver¬ 
age  amount  of  information  consumed  by  a  production  in  a  grammar. 

&1  Average  Derivation  Length 

First  we  consider  the  average  length  of  a  derivation  from  a  context-free  grammar  G, 
that  is.  the  average  number  of  productions  that  must  be  applied  in  going  from  the  goal 
symbol  v0  to  a  terminal  string  a.  We  apply  similar  techniques  to  those  previously  intro¬ 
duced,  transforming  the  set  of  productions  into  an  equivalent  set  of  simultaneous  equa¬ 
tions. 

We  will  let  D(a)  represent  Hie  average  length  of  a  derivation  from  a  string  a  of  ter¬ 
minals  and  nonterminals.  Now  consider  an  arbitrary  BNF  production 

V  -*  p,a,  ©  ®PnOn 

in  P,  the  set  of  productions  in  G.  We  want  to  compute  D(v),  the  average  length  of  a 
derivation  from  v.  In  deriving  a  terminal  string  from  a  string  containing  v,  we  apply 
the  production  i/-»eq  with  probability  p,.  If  we  apply  u-*at,  then  the  length  of  the 
derivation  is  one  plus  the  length  of  the  derivation  from  <i|.  The  same  holds  for  each 
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i/-»cq.  Thus  the  average  length  of  a  derivation  from  v  is 

D(v)  =  p,[l  +  Z?(a,)]  +  •  •  •  +pn[l  +  0{a„)] 

=  (Pi+  ■  ■  +Pn)  +PiO(al)  +  ■■■  +pnD(an) 

=  1+p, /?(«,)+  •••  +  pnD(an). 

This  result  is  intuitive:  the  constant  1  accounts  for  that  fact  that  we  must  apply  some 
production  to  eliminate  the  v,  the  remainder  of  the  terms  are  the  weighted  average  of 
the  average  derivation  lengths  of  the  alternands. 

Next  we  derive  a  number  of  rules  for  simplifying  the  right-hand  sides  of  these  equa¬ 
tions.  If  a  is  the  empty  string,  then  no  productions  can  be  applied  to  it,  so 

D(e)  -  0. 

If  a  begins  with  a  terminal  symbol  r,  a  =  r/5,  then,  since  no  productions  can  be  applied 
to  a  terminal,  we  have 

D(tP)  =  D(p). 

If  a  begins  with  a  nonterminal  symbol  p,  a  =  pp,  then,  since  both  p  and  p  must  be 
reduced  to  terminal  strings,  the  average  length  of  a  derivation  from  a  must  be  the  sum 
of  the  average  lengths  of  derivations  from  p  and  p: 

D(pP)  =  D(p)  +  Dip). 


We  summarize  the  formulas  for  computing  average  derivation  length  in  Table  5. 
TABU!  5.  Average  Derivation  Length  for  Grammar _ 


D\v  ■  •  ■  ®p„anj  =  D{v)  =  1  +p,5(a,)  +  • 

■  +PnD(an) 

D{z)  =  0 

D(  t)  =  0 

D(aP)  =  D(a)  +  D(P) 

Theorem:  Let  Fn  be  the  following  annotated  grammar  for  the  free  language  on  n  sym¬ 
bols: 


Fn  -  pe®pAFn 

A  -*  g,T,®  •  ■  •  ®gnrn. 
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Then  the  average  derivation  length  of  Fn  is 


D(Fn)  =  (p  +  l)/p  productions. 

Proof:  We  transform  the  productions  into  the  equations: 

D(Fn)  =  1  +pD(e)  +pD(AFn) 

D(A)  =  1  +  gjflCr,)  +  •  •  •  +  qnD(rn). 

Thus,  D(A)  =  1.  Simplifying  we  derive 

D(Fn)  =  1  +  p[D(A)  +  D(Fn)] 

=  1  +P  +pD(Fn). 

Solving  for  D(Fn)  we  have 

D(Fn)  =  (p  +  l)/p  productions. 

Q.ED. 

As  would  be  expected,  the  average  derivation  length  is  independent  of  the  probabili¬ 
ties  jj. 

Corollary:  The  average  derivation  length  of  Fn  with  continuation  probability  one  half 
is  3  productions. 

Proof:  Apply  theorem  with  p  =  p  =  %  Q.E.D. 

This  result  is  also  intuitive.  We  always  must  apply  at  least  one  production  (for  Fn). 
If  we  choose  to  stop,  with  probability  one  half,  we  have  applied  one  production.  How¬ 
ever,  if  we  choose  to  continue,  with  probability  one  half,  then  we  must  apply  two  more 
productions  (one  for  A,  one  for  Fn),  and  repeat  our  choice.  Tiius  we  have: 

D  =  JM+fc[2  +  Jfrl+)H2  +  Hl+  ••  ■)] 

=  H+1 +  KZ  +  H  +  H3+ 

Regrouping  gives 

D  =  (1  +  H+  •)  +  (H  +  H2  +  H3+  ) 
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=  2+1  productions. 


5.2  Average  Information  Used  by  a  Production 

Consider  a  leftmost  derivation  of  a  terminal  string  from  a  grammar.  At  each  point 
in  the  derivation  a  nonterminal  must  be  replaced  by  a  string  according  to  the  produc¬ 
tions  of  the  grammar.  In  many  of  these  cases  there  will  be  a  choice  of  which  of  several 
alternate  productions  are  to  be  applied.  Such  a  choice  will  require  information  to  be 
supplied.  Considering  the  average  information  that  must  be  supplied  per  applied  pro¬ 
duction  gives  us  a  gauge  of  the  efficiency  with  which  a  grammar  transforms  informa¬ 
tion  into  terminal  strings. 

Since  in  an  unambiguous  context-free  grammar  there  is  a  unique  leftmost  deriva¬ 
tion  of  any  string  in  the  language  generated  by  the  grammar,  we  can  set  up  a  one-one 
correspondence  between  the  leftmost  derivations  and  the  strings.  Thus,  for  any 

oeL(C)  there  is  a  unique  series  of  productions  nt,  nz . n*  that  generates  o.  As  we 

saw  before,  if  production  n4  has  an  a  priori  probability  P(ni)  of  being  chosen,  then  the 
probability  that  G  will  generate  a  is 

Pc(o)  =  P(Tr,)P(nz)  •  •  •  P(r r*). 

Hence,  the  generation  of  a  string  by  a  grammar  can  be  viewed  as  a  series  of  choices 
....  nt.  having  probabilities  P(nt),  ....  P(tr*)  of  being  made. 

Now  we  look  at  this  formula  a  different  way.  Recall  [Shannon,  Hamming]  that  when  a 
previously  undetermined  situation  with  a  priori  probability  p  is  determined,  the  infor¬ 
mation  conveyed  is  — lg  p  bits.  That  is,  information  is  conveyed  by  making  choices. 
Thus,  the  information  conveyed  by  making  choice  nit  with  probability  P(ir4),  is 

/(rr4)  =  -lg  P(rr»)  bits. 

Therefore,  the  information  conveyed  by  a  is  just  the  total  information  conveyed  by  the 
choices  that  lead  to  o: 

Ic(a)  =  -lgPc(o)  =  -lg  P(rr()  +  •  ■  •  +  -lg  P(nk)  =  /(rr,)  +  •  •  •  +  /(rr*). 
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This  information  is  used  by  the  grammar  in  going  from  undetermined  nonterminal 
symbol  to  a  completely  determined  terminal  string.  That  is,  this  information  drives  a 
decrease  in  the  entropy  from  H(G)  to  0  (since  a  terminal  string  has  no  entropy). 

Now  recall  that 

H(G)  =  -  2  Pda)  \g  Pda) 

oei(C) 

=  S  Pda)Ida). 

aeL[C) 

Thus,  the  entropy  of  a  language  is  the  average  information  conveyed  by  its  strings. 

A  grammar  with  higher  entropy  is  less  constrained,  more  disordered,  than  one  with 
lower  entropy,  so  on  the  average  it  takes  more  information  to  generate  a  particular 
string  from  it.  A  grammar  with  low  entropy  is  highly  constrained,  so  on  the  average  lit¬ 
tle  information  is  needed  (or  used)  in  generating  a  string;  there  are  fewer  choices  to  be 
made. 

We  can  now  apply  these  results  to  determining  the  average  information  used  per 
production  by  a  grammar.  Since,  the  information  conveyed  by  a  string  is  the  same  as 
the  information  used  in  its  derivation,  the  entropy  of  a  language  measures  the  average 
information  used  by  a  grammar  in  generating  a  string  of  that  language.  If  we  also  know 
the  average  derivation  length  for  that  grammar,  that  is,  the  average  number  of  produc¬ 
tions  needed  to  generate  a  string,  then  we  know  the  average  amount  Q  of  information 
consumed  per  production.  Summarizing, 

Q(C)  =  H(G)/D(G), 

where  H{G)  -  JI[L(G)]. 

Theorem;  Consider  the  following  grammar  for  the  free  language  on  n  symbols; 

Fn  -*  pt®pAFn 
A  -*  g,r,©  •  •  •  ©g„Tn. 

This  grammar  uses  on  the  average 
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Q(Fn)  =  [H{p,p)  +pH(q  . . 9n)]/(p+l)  bits/production 

to  generate  a  string. 

Proof:  Recall  that 

H(Fn)  =  [H(p,p)  +pH(qi . qn)Vp  bits 

D(Fn)  =  (p  +  \)/p  productions. 

and  divide. 

Q  ED. 

Corollary:  F%  with  continuation  probability  one  half  uses  on  the  average  one  bit  of 
information  per  production. 

Proof:  Set  jo  =  />  =  g  1  =  <72  =  }£■  Then,  , 

q(f2)  =  wm+%w.m/()t+ 1) 

=  Dm) 

=  ibit. 

Q.E.D. 

This  is  intuitive,  since  this  grammar  must  use  one  bit  on  each  production,  either  to 
decide  whether  or  not  to  continue,  or  to  decide  which  symbol  to  generate. 

Consider  Fu  the  above  grammar  restricted  to  generate  the  free  language  on  one 
symbol,  with  continuation  probability  one  half.  The  above  formula  says  the  information 
consumed  per  production  is 

Q(Fi)  =  +  l)]/0$+  1) 

=  Hmv(i%) 

=  2/3  bits/production. 

This  might  be  surprising,  since  it  seems  that  with  each  successive  symbol  of  the  string 
exactly  one  bit  of  information  is  being  used,  namely,  to  decide  whether  or  not  to 
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continue.  The  source  of  the  inefficiency  can  be  found  by  looking  at  F j: 


Ft  -*  ft®  AFt 
A  -*  r 

There  is  a  redundant  production  A  -»  r  that  uses  no  information;  this  decreases  the 
average  information  used  per  production.  If  we  eliminate  this  redundancy,  we  get  the 
one-production  grammar 

Ft  -*  pe©pr/V 

It  has  the  same  entropy  as  the  previous  grammar,  but  a  shorter  average  derivation 
length: 

D(FX)  =  1  +pD(Fx) 

=  l/p 

For  p  =  J5  its  average  derivation  length  is  2  productions,  as  opposed  to  3  for  the  version 
with  the  redundant  rule.  The  information  used  per  production  is  then 

Q(FX)  =  H(F ,)/  D(F  i) 

_  ,//(p.f)/p. 

1/p 

=  H{p,p)  bits/production. 

In  the  case  p  =  this  grammar  uses  one  bit  per  production,  as  would  be  expected. 
Thus  we  have  a  way  of  comparing  the  efficiencies  with  which  grammars  use  information 
and  of  determining  whether  grammars  have  useless  productions. 

ft.  Example  Applications 

In  this  example  we  illustrate  the  previously  described  techniques  by  their  application 
to  a  nontrivial  language.  Table  8  shows  the  context-free  grammar  for  lambda-calculus 
expressions;  we  have  added  unknown  production  probabilities.  We  now  apply  77  to 
these  productions  and  solve  for  the  negentropy.  Thus, 


TABI g  6.  Annotated  Grammar  for  Lambda  Calculus _ 

E  =  qil 

©  g^,(*A7£’,)' 

©  qa'l'EE')' 

I  =  (pitt©p2b®  •  ••  ®p3e9)p+ 

where  7i+g2+g3  =  1.  andpi+pa+  •  •  •  Paa  =  1 _ 

#(£•)  =  tf(g  1(g2,g3)  +  7 i/7(7)  +  92^1*  0  A  /  £" )'}  +  q3H{‘  ('  £  E  '  )*{. 

Tokens  can  be  ignored  in  computing  negentropies,  so  this  reduces  to 

^(E)  =  H(qx,qz.q2)  +  qxH{I)  +  g2[/7(7)  +  H{E)]  +  g3-2W(ff) 

=  H{qx,qz,qz)  +  ( qx+qz)H(I )  +  (gz  +  Zgg)/?^). 


Solving  for  H(E)  we  have 


H(E)  = 


H(qx.qz.qa)  +  (71+72)7/(7) 

1  -  7z  ~  2ga 


It  remains  to  solve  for  77(7).  We  apply  the  formula  for  H(AP*)  to  get 

I 

7/(7)  =  77|(p,a©  •  •  •  pae9)ptl 

=  [7/(p.f>)  +  7/{pia®  •  •  ©Pao9|]/p 
|  =  [77(P.P )  +  THpx.Pz - .Px)Vp- 

The  resulting  formula  for  the  negentropy  of  the  lambda-calculus  is: 


7/(71.92.73)  +  (7i+92)[//(p.p)  ♦  7/(p  . . Pao)Vp  LLt_ 

l-72-2g3 


To  actually  compute  the  negentropy  it  is,  of  course,  necessary  to  determine  the  pro¬ 
duction  probabilities p,  gj,  qz,  g3.  P\.  p2 . Pao-  Since  all  the  probabilities  associated 

in  an  alternation  must  add  to  unity,  there  are  just  38  independent  probabilities  to  be 
determined. 


To  determine  these  probabilities  we  will  calculate  the  measurable  properties  of  the 


strings  in  the  language:  the  average  string  length  and  the  occurrence  densities  of  the 
tokens.  The  average  length  is: 


X  =  A (E) 


=  9jA(/)  +  g2 ArC  X  I  E‘)'\  +  gaAj1  ('EL'  )j 

=  ?iA(/)  +  gztAJ-CJ  +  AjXj  +  A  (/)  +  X  +  A }*)'}]  +  93[Af  (i  +  JX  +  Af)*n 
=  9jA(/)  +  ?2[A(/)  +  X  +  3]  +•  g3[X  +  2] 

=  (9i+92)A (/)  +  (92+293)X  +  3gz  +  2ga. 

Solving  for  X  we  have 

_  (gi+gg)A(7)  +  3g2  +  2ga 
l-gz-2ga 

It  remains  to  compute  A(/): 

A(7)  =  A{(p,a©  •  •  •  ©p*9)*} 

=  Afp,a®  •  •  •  ®Pa69J/P 
=  (p,+  •  •  •  +p3e)/P 
=  1/p  tokens. 

Substituting  this  result  into  the  formula  for  X  gives  the  average  length  of  a  string  in  the 
lambda-calculus: 


X  b 


(qi+qz)/P  +  3gg  +  2g3 
1  -ga  -2g3 


tokens. 


Recalling  that  the  information  density  of  a  language  is  the  ratio  of  its  entropy  and  aver- 
age  length,  we  have 


V 


H(qi.qz.qa)  +  (qx  q2)[H(p,p)  +  77(p, . Px)]/p  bits/lokpn 

(9i+9e)/p  +  3gz  +  2ga 


It  remains  to  compute  the  frequencies  of  occurrence  of  the  tokens  in  the  language. 
First  we  compute  ?]P,  the  average  number  of  left  parentheses  in  a  string. 

Pip  =  *lP(e) 

=  9i*ip(/>  +  9a*  ipt‘  ('  X  /  E  ')']  +  g3V  C  EE'  )j 
s  9r®  +  92(1 +Pi>)  *  9s(l*pip +Pip) 
s  qz  +  qs  +  (92  +  2g3)pip. 
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Solving  for  ^  we  have 


ftp  = 


92  +  93 
l-g2-2g3‘ 


Clearly,  ^  =  <p]p.  It  is  also  easy  to  compute  the  corresponding  densities: 

<*ip  =  <*rp  =  ftp/*  =  ftp/*- 

Next  we  compute  <pk,  the  frequency  of  the  token  'A',  by  the  same  process: 
ft*  =  *a(k) 

=  <7 i$a(/)  +  92^1'  C  A  /  E  ' )'}  +  g3$x}-  (’  E  E  *  )■} 

=  qZ[W  +  *a<*>]  +  2g3$x(£’) 

=  72(1  +  ft>)  +  2g  ab¬ 
solving  for  <pk  we  have  1 


1  ~  9z  -  2ga* 

and  dx  =  <pk/  A. 

Finally,  we  solve  for  <pit  the  frequency  of  the  i-lh  alphanumeric  character: 
ft  =  $*(*) 

=  9i*t(/)  +  9a[*»(/)  +  *i(E)]  +  Zq^{E) 

=  (9i+92)$i(0  +  (92+293)994. 

Solving  for  we  have 

,  ...  <9i  +  9z)*i(/) 

*  '  T  —,-2,,  ■ 

It  remains  to  determine  $*(/): 

♦*(/)  =  *4f(p,a©  •  ■  •  ©p369)pf! 


=  fyfpia®  •  •  '  ®Pso9}/p 
=  [Pi*t(a)  +  •  •  •  +p30$<(9)]/p 
=  Pi/p. 


Substituting  this  into  the  equation  for  ft  yields 

_  (?i  +  ?a>Pi 

1  (l-q2-2  g3)p' 

As  usual,  =  ft/A. 

Unfortunately,  the  equations  derived  for  the  lambda-calculus  are  quite  difficult  to 
solve.  In  practice  they  would  probably  have  to  be  solved  numerically. 

We  can  gain  some  insight  into  these  equations  by  considering  their  behavior  in  some 
typical  situations.  Therefore,  suppose  that  all  the  identifier  characters  are  equally 
likely: 

Pi  =  Pz  =  •  •  •  =  Psa  =  1/36. 

Then  we  have 

_  9  i  +  9z 

4  36p(l  —  ~2q3)‘ 

Now  let  n  =  \-qz-2qs.  Then  we  have 

*  =  r(7i+7z)/p  +  3  g2  +  2g3]/a, 

-  qZ/O; 

Vlp  =  Prp  =  (92+73)/ a. 

Pi  =  (91+92)/ (36p  a). 

Rewriting  the  first  equation: 

A  =  [(9i+9z)/Pal  +  (392  +  293)/  a 
=  ft/ 36  +  (3g2  +  293)/ a 
—  ft/  36  +  92/  fl  +  (29  2  +  293)/  a 
=  ft/  36  +  ft  +  2ftp. 

If  we  rewrite  this 


A  =  ft/ 36  +  ft  +  ftp  +  ftp, 


then  it  becomes  intuitive:  the  average  length  of  a  string  is  the  sum  of  the  average  fre¬ 
quencies  of  occurrence  of  each  terminal  symbol. 


Finally,  we  derive  the  information  consumed  per  production  by  this  grammar.  To  do 
this  we  expand  the  Kleene  cross: 

/  -»  pA  ®pAI 
A  -*  pi  a®  •  •  •  P3e9 


and  compute  its  average  derivation  length: 

D(l)  =  1  +  pD(A)  +pD(A)  +  pD(I) 

D(A)  =  1 

Therefore,  D(I)  =  2  +  pD(I),  so  D(I)  =  2/p.  Next  we  compute  D(E ): 

D(E )  =  1  +  q  ,£>(/)  +  qz[D(I)  +  D(E))  +  qa[2D(E)] 

=  1  +  2(gi+g2)  +  (qz+2q3)D(E). 

Therefore, 

D(E)  =  — -------  productions. 

l-gz-2g3  K 


The  average  information  used  per  production  is  the  ratio  11(E)/  D(E),  which  is 

_  H(qi,qz,qa)  +  (q i+qz)[II(p ,p)  *  H(pi . Pae)Vp 


Q  = 


1  +  2(g,+gz) 


bits/production. 


7.  Conclusions 

We  have  described  means  for  computing  a  number  of  information-theoretic  properties 
of  languages  and  their  grammars.  These  properties  include,  for  languages,  their 
entropy,  average  string  length,  information  density  and  density  of  occurrence  for  a 
given  token.  For  grammars  we  have  shown  how  to  compute  average  derivation  length 
and  the  information  used  by  the  grammar  per  production. 

All  of  these  techniques  are  based  on  the  application  of  simple  recursive  formulas  to 


annotated  grammars,  grammars  annotated  with  production  probabilities.  It  appears 
that  the  same  techniques  can  be  applied  to  the  computation  of  many  other  properties 
of  both  grammars  and  other  symbol  systems. 
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